Created by Lavita Singhania
This project aims to identify potential borrowers and effectively segment them for superior credit dispersal. The project provides insights into the credit worthiness of these segments.
This project uses an unsupervised learning algorithm, K-means Clustering for segmentation of unlabelled data.
The data consists of 1000 customers and 21 attributes for each customer.
Each entry represents a person who takes credit from the bank.
Important variables of the data
Project Summary
Analyzing cluster behavior
Targetable customer base
Project Working
Loading libraries
Data processing
Data cleaning
Checking for missing values
Exploratory analysis
Distribution plots
Boxplots
Scatter matrix
K - Means Clustering for Segmentation
Data Preprocessing for clustering
Finding the optimal number of clusters using the elbow method
Cluster visualization
Other Insights on Cluster Behavior
#Mean values of all variables in the cluster
grouped = german_data.groupby('Cluster', as_index=False)['age', 'duration', 'amount', 'response'].mean().round(2)
print(grouped)
warnings.filterwarnings('ignore')
Comments:
- Cluster 0 is the group of old aged who take smaller loans for a smaller duration
- Cluster 1 is the group of young customers who take moderate loans for a longer duration
- Cluster 2 is the group of young customers who take really small loans for a very short duration
- Cluster 3 is the group of mid age customers who take very high amount loan for a longer durationAnalysing each cluster, Clusters 0 and 2 have good response rates which indicates that users in these clusters are good borrowers and more such users should be targeted. For Cluster 3, as expected with increase in credit amount, the response rate is less. However considering this segment is most important to the bank, better due diligence should be followed. For Cluster 1, even though the credit amount in only slightly higher than the median value the response rate in pretty low. The loan dispersal process for this segment (younger age group wanting a higher duration loan amount) needs a lot of improvements
plot('job')
plot('present_emp')
plot('sex')
plot('credit_his')
Comments:
- Cluster 0 is a segment of customers who are old aged taking small amount loans for smaller duration. 74% of these customers are either skilled employees or unskilled - residents. They mostly have 7+ yrs of work experience in their current jobs, are single males and 86% of them either have existing credits paid or have other credits existing.
- Cluster 1 is a segment of customers who are young taking moderate amount loans for moderate duration. 71% of these customers are skilled employees. 60% of the users have 1-7 yrs of work experience in their current jobs, they are a good balance of either single males or married/divorced females and 83% of them either have existing credits paid or have other credits existing.
- Cluster 2 is a segment of customers who are young taking very small amount loans for very small duration. 91% of these customers are either skilled employees or unskilled - residents. 66% of these users are just starting jobs with either >1yr or 1-4yrs of work experience, they are a combination of single and married males and married/divorced females and have a credit history.
- Cluster 3 is a segment of customers who are mid aged taking high amount loans for higher duration. There is a high proportion of self employed users here who are evenly distributed between 1-7yrs of work experience. 70% of them are single males and have credit history with significant users having delay in paying their past credits.
grouped2 = german_data.groupby(['Cluster','purpose'], as_index=False)['amount','response'].agg(['mean','count'])
grouped3 = grouped2.reset_index()
grouped3.columns = ['Cluster', 'Purpose', 'Mean_Amount','Count', 'Mean_Response', 'Count_Response']
grouped4 = grouped3[['Cluster', 'Purpose', 'Mean_Amount','Count', 'Mean_Response']]
grouped4.head()
fig5 = px.scatter(grouped4, x="Mean_Response", y="Mean_Amount", size="Count", color="Cluster",
hover_name="Purpose", size_max=60)
fig5.update_layout(height=600, width=1000, title_text="Credit Worthiness of each segment")
warnings.filterwarnings('ignore')
fig5.show()
Please note: X-axis here is mean amount and Y-axis is Mean Response.
The size of the bubble is indicated by the count of users in the category.
Category is shown when we hover over the bubble.
Comments:
- Ideally, we would want the top right to be more populated in the above chart. To achieve this, below are some of the target areas that can be focussed on by the bank.
- Cluster 0: This segment should be targeted higher with used cars loans. More marketing effort can be done on this segment for used cars to increase the count of users in this segment. Further this segment can be targeted with Radio/TV and New Car loans. If there are users from this segment wanting to take loans for business, more due diligence needs to be done to avoid turning it into a bad loan.
- Cluster 1: This segment of customers are little low on importance since they have lower loan amounts and yet 34% defaulters. This segment can again be targeted with lower loan amounts taken for used cars and business and more due diligence to be followed for Radio/TV and Furniture loans.
- Cluster 2: For this segment more marketing should be done for Business and Furniture loans to increase the number of users. Retraining, although very small can also be considered as a potential target segment for these users. A large portion of the users in this segment are also interested in new car loans, with better due diligence the response rate can be improved.
- Cluster 3: For this segment more marketing should be done to attract Radio/TV and used cars buyers. This is an important segment for the business, since they take a higher loan amount but also have defaulters. Hence due diligence needs to be improved for car, business and furniture loans in this segment to mitigate the risk.Additionally, if the Bank is looking to give more Automobile loans (for new cars), they should be targeting customers from Cluster 0 and 2 rather than users from Cluster 1 and 3.
For Education loans, Cluster 0 should be targeted which has significantly better response rates and for Furniture/Equipment loans, Cluster 1 and 3 should be targeted rather than Cluster 0 and 2.
#Data manipulation libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
#Data visualization libraries
import plotly.express as px
import pandas as pd
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import matplotlib as mlt
import matplotlib.pyplot as plt
import seaborn as sns
#K-means clustering libraries
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AffinityPropagation
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
credit = pd.read_csv(url, sep= " ", names = ["chk_acct", "duration", "credit_his", "purpose","amount",
"saving_acct", "present_emp", "installment_rate", "sex",
"other_debtor", "present_resid", "property", "age",
"other_install", "housing", "n_credits", "job", "n_people",
"telephone", "foreign","response"])
credit.head()
credit.info()
german_data = credit
german_data.head()
#Cleaning data
german_data.chk_acct.replace(['A11','A12','A13','A14'],['little', 'moderate','rich', 'no checking account'], inplace = True)
german_data.credit_his.replace(['A30','A31','A32','A33','A34'],['no credits taken',
'all credits paid duly',
'existing credits paid duly',
'delay in paying off in the past',
'other credits existing'], inplace=True)
german_data.purpose.replace(['A40','A41','A42','A43','A44','A45','A46','A47','A48','A49','A410'],
['car (new)','car (used)', 'furniture/equipment','radio/television',
'domestic appliances','repairs', 'education', 'vacation', 'retraining',
'business', 'others'], inplace=True)
german_data.saving_acct.replace(['A61','A62','A63','A64','A65'],
['little', 'moderate', 'rich','quite rich','no savings account'], inplace=True)
german_data.present_emp.replace(['A71','A72','A73','A74','A75'],
['unemployed', '< 1yr', '>=1yr & <4yr', '>=4yr & <7yr','>=7yr'], inplace=True)
german_data.sex.replace(['A91','A92','A93','A94','A95'],
['male : divorced', 'female : divorced/married',
'male : single', 'male : married/widowed', 'female : single'], inplace=True)
german_data.other_debtor.replace(['A101','A102','A103'],
['none', 'co-applicant', 'guarantor'], inplace=True)
german_data.property.replace(['A121','A122','A123','A124'],
['real estate', 'building/life insurance', 'car/others',
'unknown/no property'], inplace=True)
german_data.other_install.replace(['A141','A142','A143'],['bank', 'stores', 'none'], inplace=True)
german_data.housing.replace(['A151','A152','A153'],['rent','own','free'], inplace=True)
german_data.job.replace(['A171','A172','A173','A174'],
['unskilled - non-resident', 'unskilled - resident',
'skilled employee', 'self-employed'],inplace=True)
german_data.telephone.replace(['A191','A192'],['none', 'yes'], inplace=True)
german_data.foreign.replace(['A201','A202'],['yes','no'], inplace=True)
german_data.response.replace([1,2], [1,0], inplace=True)
german_data.head()
#Checking missing values in the data
print("Missing values in each column:\n{}".format(german_data.isnull().sum()))
Comments: No missing values in the data
german_data[["amount", "duration", "age"]].describe().round(2)
Comments: The average credit amount taken by users is ~3.3K with an average duration of 20 months and 36 yrs of age
There is a very high variation in the credit amount
The high difference in the median and mean of the credit amount indicates, amount is highly positively skewed whereas duration and age are slightly positively skewed*
#Distribution Plots
fig = make_subplots(rows=1, cols=3)
trace0 = go.Histogram(x=german_data["amount"], name='Credit Amount Distribution')
trace1 = go.Histogram(x=german_data["duration"], name="Duration Distribution")
trace2 = go.Histogram(x=german_data["age"], name="Age Distribution")
fig.append_trace(trace0,1,1)
fig.append_trace(trace1,1,2)
fig.append_trace(trace2,1,3)
#Updating xaxes and yaxes
fig.update_xaxes(title_text="Credit Amount", row=1, col=1)
fig.update_xaxes(title_text="Duration", row=1, col=2)
fig.update_xaxes(title_text="Age", row=1, col=3)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_layout(height=500, width=1000, title_text="Distribution Plots")
fig.show()
Comments:
- Data shows that most of the credit amount is between 1500 to 4000
- As expected, credit amount is positively skewed which indicates that more people take smaller amounts of loan
- The duration and age distribution are also slightly positively skewed
#Distribution of credit amount across various reasons
fig2 = px.box(german_data, x="purpose", y="amount", color="response", title="Distribution of credit amount across various reasons")
fig2.update_layout(height=500, width=1000)
fig2.show()
Please Note:
Response 1 = Good Debt (in blue)
Response 0 = Bad Debt (in red)
Comments:
- To define maximum loan amount which can prevent bad debt, the maximum non outlier value of good debt should be used as a benchmark across each category. For Eg: For Education, the max value of loan to be given can be around 8K after which the probability of it becoming a bad debt increases
- Users have the highest median amount for used cars. The spread in the distribution of the used cars is also higher than all the other reasons.
- For automobile loans, Customers who are looking for used car should be targeted more as they take loans of with significantly more amount. However, there should be more due diligence as used car borrowers also default more.
#Scatter plots
import plotly.figure_factory as ff
fig3 = ff.create_scatterplotmatrix(german_data[['age','duration','amount']], diag='box',
#index='index',
colormap='Portland',colormap_type='cat',height=700, width=700)
fig3.show()
Comments: As expected, there is a good positive correlation between the duration and credit amount, with correlation coefficient of 62%
# Data for K-means clustering
german_data_cluster = german_data[['age', 'amount', 'duration']]
print("Original variables:\n{}" .format(german_data_cluster.head()))
german_data_cluster_tr = np.log(german_data_cluster);
print("Log transformed variables:\n{}" .format(german_data_cluster_tr.head()))
Comments: As seen in the distribution plots, It seems like our variables are highly skewed. Hence, we will perform logarithm transformation to our variables to eliminate the skewness.
#Distribution of transformed variable
fig, ax = plt.subplots(1,3,figsize=(20,5))
plt.suptitle('DISTRIBUTION PLOTS OF TRANSFORMED VARIABLES')
sns.distplot(german_data_cluster_tr['amount'], bins=40, ax=ax[0], axlabel="Credit Amount");
sns.distplot(german_data_cluster_tr['duration'], bins=40, ax=ax[1], color='salmon', axlabel="Duration");
sns.distplot(german_data_cluster_tr['age'], bins=40, ax=ax[2], color='darkviolet', axlabel="Age");
Comments: Now the variables seem normally distributed
scaler = StandardScaler()
german_data_cluster_scaled = scaler.fit_transform(german_data_cluster_tr)
#Using the elbow method to define the optimal number of clusters
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(german_data_cluster_scaled)
distortions.append(kmeanModel.inertia_)
#Plotting the elbow method
plt.figure(figsize=(10,5))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
Comments: There is hardly any change in distortion after k=4, hence we will consider the optimal number of clusters to be 4
# k-means algorithm
k = 4
kmeans = KMeans(n_clusters=k, random_state=0).fit(german_data_cluster_scaled)
german_data['Cluster'] = kmeans.labels_
#Adding the clusters back to the data
german_data['Cluster'] = german_data['Cluster'].astype('category')
german_data.head()
#3D scatter plot of the cluster
fig4 = px.scatter_3d(german_data, x='age', y='duration', z='amount',color='Cluster')
fig4.update_layout(height=500, width=700, title_text="3D Map of Cluster")
fig4.show()
#Cluster sizes
cluster_size = german_data.groupby('Cluster', as_index=True).size()
cluster_size
Comments: The distribution of the data is evenly divided into each of the clusters
#Countplots to identify segments of customers
sns.set_style('white')
fig, ax = plt.subplots(3,2,figsize=(21,20))
plt.suptitle('COUNT PLOTS',fontsize=15)
sns.countplot(german_data['job'], ax=ax[0][0], palette=sns.color_palette('RdBu'))
sns.countplot(german_data['housing'], ax=ax[0][1], palette=sns.color_palette('RdBu'))
sns.countplot(german_data['saving_acct'], ax=ax[1][0], palette=sns.color_palette('BuGn_r'))
sns.countplot(german_data['chk_acct'], ax=ax[1][1],palette=sns.color_palette('BuGn_r')[4:])
sns.countplot(german_data['purpose'], ax=ax[2][0], palette=sns.color_palette('RdBu_r'))
sns.countplot(german_data['sex'], ax=ax[2][1],palette=sns.color_palette('RdBu_r'))
ax[2][0].tick_params(labelrotation=45)
ax[0][0].set(xlabel="Job", ylabel="Count")
ax[0][1].set(xlabel="Housing", ylabel="Count")
ax[1][0].set(xlabel="Saving Accounts", ylabel="Count")
ax[1][1].set(xlabel="Checking Accounts", ylabel="Count")
ax[2][0].set(xlabel="Purpose", ylabel="Count")
ax[2][1].set(xlabel="Sex and Status", ylabel="Count")
#Distribution of age in each cluster
cluster0 = german_data[german_data['Cluster']==0]
cluster1 = german_data[german_data['Cluster']==1]
cluster2 = german_data[german_data['Cluster']==2]
cluster3 = german_data[german_data['Cluster']==3]
fig, ax = plt.subplots(4,1,figsize=(10,6), constrained_layout=True, sharex=True)
ax[0].title.set_text('Cluster 0')
ax[1].title.set_text('Cluster 1')
ax[2].title.set_text('Cluster 2')
ax[3].title.set_text('Cluster 3')
ax[0].axes.xaxis.set_visible(False)
ax[1].axes.xaxis.set_visible(False)
ax[2].axes.xaxis.set_visible(False)
sns.distplot(cluster0['age'], color='darkcyan', bins=10, ax=ax[0])
sns.distplot(cluster1['age'], color='steelblue', bins=10, ax=ax[1])
sns.distplot(cluster2['age'], color='sandybrown', bins=10, ax=ax[2])
sns.distplot(cluster3['age'], color='indianred', bins=10, ax=ax[3])
plt.xlabel('Age', fontsize=20)
#Distribution of credit amounts in each cluster
fig, ax = plt.subplots(4,1,figsize=(10,6), constrained_layout=True, sharex=True)
ax[0].title.set_text('Cluster 0')
ax[1].title.set_text('Cluster 1')
ax[2].title.set_text('Cluster 2')
ax[3].title.set_text('Cluster 3')
ax[0].axes.xaxis.set_visible(False)
ax[1].axes.xaxis.set_visible(False)
ax[2].axes.xaxis.set_visible(False)
sns.distplot(cluster0['amount'], color='darkcyan', bins=10, ax=ax[0])
sns.distplot(cluster1['amount'], color='steelblue', bins=10, ax=ax[1])
sns.distplot(cluster2['amount'], color='sandybrown', bins=10, ax=ax[2])
sns.distplot(cluster3['amount'], color='indianred', bins=10, ax=ax[3])
plt.xlabel('Credit Amount', fontsize=20)
#Distribution of duration in each cluster
fig, ax = plt.subplots(4,1,figsize=(10,6), constrained_layout=True, sharex=True)
ax[0].title.set_text('Cluster 0')
ax[1].title.set_text('Cluster 1')
ax[2].title.set_text('Cluster 2')
ax[3].title.set_text('Cluster 3')
ax[0].axes.xaxis.set_visible(False)
ax[1].axes.xaxis.set_visible(False)
ax[2].axes.xaxis.set_visible(False)
sns.distplot(cluster0['duration'], color='darkcyan', bins=10, ax=ax[0])
sns.distplot(cluster1['duration'], color='steelblue', bins=10, ax=ax[1])
sns.distplot(cluster2['duration'], color='sandybrown', bins=10, ax=ax[2])
sns.distplot(cluster3['duration'], color='indianred', bins=10, ax=ax[3])
plt.xlabel('Duration', fontsize=20)
#Defining a function to create plot
def get_df(data):
out = data.value_counts(normalize=True).reset_index()
return(out)
def plot(x):
fig = go.Figure()
fig.add_trace(go.Bar(
x=get_df(cluster0[x])['index'],
y=get_df(cluster0[x])[x],
name='Cluster 0',
marker_color='mediumaquamarine'
))
fig.add_trace(go.Bar(
x=get_df(cluster1[x])['index'],
y=get_df(cluster1[x])[x],
name='Cluster 1',
marker_color='steelblue'
))
fig.add_trace(go.Bar(
x=get_df(cluster2[x])['index'],
y=get_df(cluster2[x])[x],
name='Cluster 2',
marker_color='sandybrown'
))
fig.add_trace(go.Bar(
x=get_df(cluster3[x])['index'],
y=get_df(cluster3[x])[x],
name='Cluster 3',
marker_color='indianred'
))
fig.update_layout(barmode='group', xaxis_tickangle=45, title=x)
fig.show()
plot('saving_acct')
plot('installment_rate')
plot('housing')
plot('n_credits')